Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

146

Applications in Natural Language Processing

5.11

Post-Training Embedding Binarization for Fast Online Top-K

Passage Matching

To lower the complexity of BERT, the recent state-of-the-art model ColBERT[113] employs

Contextualized Late Interaction paradigm to independently learn ﬁne-grained query-passage

representations. It comprises: (1) a query encoder fQ, (b) a passage encoder fD, and (3)

a query-passage score predictor. Speciﬁcally, given a query q and a passage d, fQ and fD

encode them into a bag of ﬁxed-size embeddings Eq and Ed as follows:

Eq = Normalize(CNN(BERT(”[Q]q0q1 · ql”))),

Ed = Filter(Normalize(CNN(BERT(”[D]d0d1 · dn”)))),

(5.44)

where q and d are tokenized into tokens q0qq · ql and d0d1 · dn by BERT-based WordPiece,

respectively. [Q] and [D] indicate the sequence types.

Despite the advances of ColBERT over the vanilla BERT model, its massive computa-

tion and parameter burden still hinder the deployment on edge devices. Recently, Chen et

al.[40] proposed Bi-ColBERT to binarize the embedding to relieve the computation burden.

Bi-ColBERT involves (1) semantic diﬀusion to hedge the information loss against embed-

ding binarization, and (2) approximation of Unit Impulse Function [18] for more accurate

gradient estimation.

5.11.1

Semantic Diﬀusion

Binarization with sign(·) inevitably smoothes the embedding informativeness into the bina-

rized space, e.g., −1, 1^dregardless of its original values. Thus, intuitively, one wants to avoid

condensing and gathering informative latent semantics in (relatively-small) sub-structures

of embedding bags. In other words, the aim falls into diﬀusing the embedded semantics in

all embedding dimensions as one eﬀective strategy to hedge the inevitable information loss

caused by the numerical binarization and retain the semantic uniqueness after binarization

as much as possible.

Recall in singular value decomposition (SVD), singular values and vectors reconstruct

the original matrix; normally, large singular values can be interpreted to associate with

major semantic structures of the matrix [242]. To achieve semantic diﬀusion via normalizing

singular values for equalizing their respective contributions in constituting latent semantics,

the authors introduced a lightweight semantic diﬀusion technique as follows. Concretely, let

I denote the identity matrix and a standard normal random vector p⁽^h⁾∈R^d. During

training, the diﬀusion vector p⁽^h⁾is iteratively updated as p⁽^h⁾= E^T

q ^E^q^p⁽^h⁻¹⁾^{. Then, the}

projection matrix Pq is obtained via:

Pq = ^p⁽^h⁾^p⁽^h⁾^T

||p⁽^h⁾||²

(5.45)

Then, the semantic-diﬀused embedding with the hyper-parameter ϵ ∈(0, 1) as:

Eq = Eq(I −ϵPq).

(5.46)

Compare to the unprocessed embedding bag, i.e., Eq, embedding presents a diﬀused seman-

tic structure with a more balanced spectrum (distribution of singular values) in expectation.